Internet Info 1997 December

home *** CD-ROM | disk | FTP | other *** search

/ Internet Info 1997 December / Internet_Info_CD-ROM_Walnut_Creek_December_1997.iso / ietf / urn / urn-archives / urn-ietf.archive.9611 / 000158_owner-urn-ietf _Thu Nov 14 05:39:44 1996.msg < prev next >

Wrap

Internet Message Format | 1997-02-19 | 7KB

Received: (from daemon@localhost) by services.bunyip.com (8.6.10/8.6.9) id FAA21979 for urn-ietf-out; Thu, 14 Nov 1996 05:39:44 -0500 Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.6.10/8.6.9) with SMTP id FAA21973 for <urn-ietf@services.bunyip.com>; Thu, 14 Nov 1996 05:39:37 -0500 Received: from dicsmss1.jrc.it by mocha.bunyip.com with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA18908 (mail destined for urn-ietf@services.bunyip.com); Thu, 14 Nov 96 05:39:29 -0500 Received: from jrc.it (elect6.jrc.it) by dicsmss1.jrc.it (4.1/EB-950131-C) id AA12722; Thu, 14 Nov 96 11:44:01 +0100 Received: by jrc.it (5.x/EB-950213-L) id AA01418; Thu, 14 Nov 1996 11:38:37 +0100 Date: Thu, 14 Nov 1996 11:38:37 +0100 From: Dirk vanGulik <Dirk.vanGulik@jrc.it> Message-Id: <9611141038.AA01418@ jrc.it> To: Dirk.vanGulik@jrc.it, mduerst@ifi.unizh.ch Subject: Re: [URN] I18N does not belong in URNs Cc: FisherM@is3.indy.tce.com, moore@cs.utk.edu, girod@LCS.MIT.EDU, tallen@fsc.fujitsu.com, urn-ietf@bunyip.com X-Sun-Charset: US-ASCII Sender: owner-urn-ietf@services.bunyip.com Precedence: bulk Reply-To: Dirk vanGulik <Dirk.vanGulik@jrc.it> Errors-To: owner-urn-ietf@bunyip.com > Dirk.vanGulik wrote: > >Or they use a naming scheme dependent interpretation. You could for > >example simply limit the representation of the URN to the glyphs > >A-Z, 0-9 and say the dot, dash, and colon. > > Too much in favor of English users (+Hawaiian and Suwaheli)! Well as a dutch person working in italy in a swiss building; I can quite live with it :-) No just kidding, but seriously why does this 'favour' ? And where does it come in ? > > I recently came across a specific scheme, say 'crdis' which has a lot of > > LocalControlIdentfiers which can (only be fully) expressed in 2-byte octets. > > This obviously gave problems as some of our z39.50 servers and HTTP/URN > > resolving code did not quite like this. A pragmatic solution was to simply > > base64 encode the identfication string. > > A specialist GUI could now base64-decode the string to arrive > > at something meaningfull for a human. But it does not have to. > > And it does keep things simple. > I have mentionned earlier the possibility that some namespaces may > contain arbitrary data, as opposed to characters, and I have therefore Hold on, I was trying to convey that these strings did _INDEED_ contain meaningfull text; auctually the various 'strings' where short titles in the 19 european languages; carefully protected. So we are not talking arbitrary strings here; although they *function* as an arbritary handle. But for the expert their actual content conveys more than just a number. The same could be said of my phone number +39 332 78 0014 Which tells a local here that it is me, living in Ispra(78), near Varse (33*)in italy(39). But for most people those digits make no sense all, nor have to. And my local phone technician might even tell me more by just looking at it. > suggested that the requirement that all URNs be UTF-8 should be > relaxed. I have given the data: URL as an example. Here we have > an obvious second examlpe. > > >I really would like to roll up the charset discussion; I agree that for a lot > >of scheme's, in particular those to be grandfathered in, one will need very > >flexible encodings. > They need to be very flexible in the sense that they have to be > able to accomodate a very large set of characters, and something > else than characters in some cases. Well, that is a requirement I do not find in the functional RQ RFC at all. Should that be re-written ? I agree that publishers and maintainers might like to 'overload' the content; but that is a different issue. > But there is absolutely no reason to have arbitrary flexibility > for those cases where indeed characters have to be represented. > For this case, it is much easier to specify that UTF-8 be used > (optionally or even mandatorily with %HH encoding). > >But the internet transfer mechanisms are not quite up to > >that yet. So I would suggest: > > 1. Limit the URNs to just a few chars (along the lines of DNS) This > > also makes comparing URNs easy. > As for the character set *representing* URNs, I can agree. This would > basically mean that everything has to be %HH encoded. Hmm, I think we have a slight terminology clash; much like the problems we have had with the URL rfc. There are several layers, which their own dimensions. In URLs (correct me if I am wrong Larry) this is solved by saying that the acutal URL is an octet-stream. These 8-bit encoded values are all there is. However by 'pure coincidence' they can be treated as indexes into a charset such as US-Asicc or latin-1 and actually yeild something which humans can interpret quite easily. But for example the first 5 values (say http:) are not the glyph 'h', 't', p'p and ':' but the values 68747470 or 4854545. (Upper/ lowercase), So what I was saying is that the URN is an octet stream; and the allowed values are from ox30 to 0x39, 0x41--0x5a etc. Which happen to represent indexes into latin-1, UTF-8 or ascii. Now a clever administrator (and a clever GUI) can use something like base64 to get a nice UniCode string in. > > 2. Allow, or perhaps even force, each registered naming scheme > > to suggest a possible encoding/escape sequencing to derive > > nice human names from the URN. This can be used by the more > > advanced GUIs. > The greatest majority of naming schemes will have identical problems, > namely how to represent characters in namespaces. And a single and > very simple GUI can provide nice human representations for these > cases (modulo available fonts). And a single and widely applicable > convention should be used by all naming schemes dealing with > characters. > > 3. And keep a few chars (say the %) in stock, for the future. > > 4. And remember, one can always make something like a next generation > > URN+:das:asd which can only be transcribed properly using say UTF8. > There is absolutely no need to wait any longer for UTF-8. Well, I can agree; but not for the premisse; I do not see the need for the charset flexibilit. I know people can get quite religious about their names; so you have a point that we should be flexible to accomodate, but on the other hand it does make implementation harder whilst the functionality does not increase a bit. > >I know that such an encoding essentially wastes space; but that is a tradeoff > >I am willing to make for simplfied storage, encoding and comparing. > All your requirements can be fulfilled by: > - Allowing non-character namespaces to create their own non UTF-8 encodings > (as I have suggested previously). > - Require that internally, all 8-bit octets resulting from UTF-8 encoding > of characters beyond ASCII (+some in ASCII) be encoded with > %HH, so that there is no 8-bit from of URNs. > Although I have good reasons to think that the second point is not needed, > I could live with it. But giving up a convention such as UTF-8 to map > arbitrary characters in namespaces to URNs would be a great loss. Well I think I do not see those requirements; but perhaps we should look at the RQ RFC again, to see what possibly is missing. Dw.